PROCEDURE POUR ANALYSER LES DONNEES DU VIROME EN SORTIE DU PIPELINE
Avant de commencer
Pages web de référence pour Rmarkdown (pour aller plus loin) :
https://juba.github.io/tidyverse/13-rmarkdown.html
https://bookdown.org/yihui/rmarkdown/documents.html
https://stt4230.rbind.io/communication_resultats/redaction_r_markdown/#r-markdown
0- Load libraries
Librairies nécessaires pour l’utilisation de ce script R : à télécharger
# Pour lire les fichiers csv
library("readr")
# Pour manipuler les données (accompagnée automatiquement par les packages dplyr et tidyr)
library("tidyverse")
library("dplyr")
library("tidyr")
# Pour quasiment toutes les figures
library("ggplot2")
# Pour lire fichiers excel
library("readxl")
# Autre package pour manipuler les données
library("gdata")
# Pour les tableaux interactifs
library("DT")
# Test statistiques donc test de wilcoxon (fonction "wsrTest")
library("asht")
# Heatmaps "superheat"
library("superheat")
# Heatmaps "ComplexHeatmap"
library("ComplexHeatmap")
# Thèmes typographiques spécifiques pour ggplot (fonction "theme_ipsum")
library("hrbrthemes")
# Pour intégrer tests stats directement dans ggplot
library("ggsignif")
# Fonction "glue"
library("glue")
# Diagramme de Venn
library("ggvenn")
# Fonction "PieDonut"
require("moonBook")
# Fonction "PieDonut"
require("webr")
# Tavailler avec des graphiques en grille
library("gridExtra")
# Fonction "melt"
library("reshape2")
# Visualisation graphique des données
library("patchwork")
# Visualisation graphiques en 3D
library("rgl")
# Visualisation des espèces indicatrices
library("indicspecies")1- Récapitulatif des filtres appliqués sur les reads bruts lors de l’analyse bio_info
Les reads undetermined sont utilisés pour l’étape de co-assemblage puis éliminés, ils ne sont pas pris en compte dans les comptages car non attribués à un échantillon.
duplicatsfilt_ctrl= soustraction des reads du contrôle le plus chargé pour chaque OTUfilt_10X= filtre minimum 10 reads par échantillons (< 10 reads par OTU = 0)filt_100X= minimum 100 reads par OTUfilt_USP= élimination des reads bactériens/bactériophage/champignons et virus non désirés (exPPR)clustering= permet de résoudre les multiassignations : groupage desTaxIDpar les 5 meilleurs besthit (on garde 5 sorties pour chaque contig)
Table_data <- read_excel(Table_data.file)
Table_data <- as.data.frame(Table_data)
rownames(Table_data) <- Table_data$Object
Table_data <- Table_data[,-c(1:3)]
Table_data <- as.data.frame(t(Table_data))
Table_data$Filter <- rownames(Table_data)
# Basic Barplot : reads number
my_bar <- barplot(Table_data$reads_number, border=F , names.arg=Table_data$Filter,
las=1 ,
col=c(rgb(0.3,0.1,0.4,0.6) , rgb(0.3,0.5,0.4,0.6) , rgb(0.3,0.9,0.4,0.6) , rgb(0.8,0.7,0.4), rgb(0.8,0.7,1)) ,
ylim=c(0,700000) ,
main="Impact of filters on reads number -28594 reads -4.4%",
ylab = "reads", xlab="Filter")
# Add the text
text(my_bar, Table_data$reads_number+15000 , paste("n: ", Table_data$reads_number, sep="") ,cex=1)# Basic Barplot : contigs
my_bar_contigs <- barplot(Table_data$contigs_number, border=F , names.arg=Table_data$Filter,
las=1 ,
col=c(rgb(0.3,0.1,0.4,0.6) , rgb(0.3,0.5,0.4,0.6) , rgb(0.3,0.9,0.4,0.6) , rgb(0.8,0.7,0.4), rgb(0.8,0.7,1)) ,
ylim=c(0,620) ,
main="Impact of filters on contigs number -329 contigs -57.5%",
ylab = "contigs", xlab="Filter")
# Add the text
text(my_bar_contigs, Table_data$contigs_number+10 , paste("n: ", Table_data$contigs_number, sep="") ,cex=1)# Basic Barplot : vOTUs
my_bar_otu <- barplot(Table_data$otu_number, border=F , names.arg=Table_data$Filter,
las=1 ,
col=c(rgb(0.3,0.1,0.4,0.6) , rgb(0.3,0.5,0.4,0.6) , rgb(0.3,0.9,0.4,0.6) , rgb(0.8,0.7,0.4), rgb(0.8,0.7,1)) ,
ylim=c(0,270) ,
main="Impact of filters on vOTUs number -154 vOTUs -61.4%",
ylab = "vOTUs", xlab="Filter" )
# Add the text
text(my_bar_otu, Table_data$otu_number+5 , paste("n: ", Table_data$otu_number, sep="") ,cex=1) En sortie du pipe on a 6 fichiers :
Stat_by seqid= contigs et seqid pour chaque librairie et le nombre de readsseq_hits_info= stat de chaque contigTax_table= taxonomie viralecount_table= nombre de reads/ librairie (faire le tableau présence/absence)OTU_stat= stat pour chaque MagroupeViral_contigs= fichier fasta des reads des contigs
Il faut ajouter le fichier Metadata, (fichier qui regroupe toutes les infos de chaque librairie)
Il faut garder le count_table avant filtre (données brutes)
2- Analyses du séquençage
2-1 : Analyses des statistiques générales / statistiques descriptives
Design expérimental, nombre de librairies + les contrôles et undetermined pour les jeux de données les possédant (pas visible dans le jeux de données Sénégal présenté)
Analyse du run selon le tableau stats : exemple ci-dessous (data Sénégal)
datatable(read_csv(Stats_run.file), editable = 'cell', filter = 'top')2-2 : Vérification de l’homogénéité des données (voir s’il y a des biais de manip, si beaucoup de plaques d’extraction)
Reads_totaux/echantillon
Reads viraux/échantillon
Reads en fonction des différentes variables
Reads des contrôles et diversité virale ?
Analyse de la distribution des reads : Est-ce que les données suivent une loi normale ?
Shapiro test
count_otu_Cx <- read_csv(Count_table.file)
apply(count_otu_Cx[,-1], 2, shapiro.test) #p-value > 0.05 indique que la distribution des données n’est pas significativement différente de la distribution normale. En d’autres termes, nous pouvons supposer la normalité des données (ce qui n'est pas le cas ici).## $`Sample_S-16-0141_7`
##
## Shapiro-Wilk normality test
##
## data: newX[, i]
## W = 0.28161, p-value < 2.2e-16
##
##
## $`Sample_S-16-0148_4`
##
## Shapiro-Wilk normality test
##
## data: newX[, i]
## W = 0.35812, p-value < 2.2e-16
##
##
## $`Sample_S-16-0143_3`
##
## Shapiro-Wilk normality test
##
## data: newX[, i]
## W = 0.24563, p-value < 2.2e-16
##
##
## $`Sample_S-16-0144_1`
##
## Shapiro-Wilk normality test
##
## data: newX[, i]
## W = 0.42619, p-value < 2.2e-16
##
##
## $`Sample_S-16-0147_6`
##
## Shapiro-Wilk normality test
##
## data: newX[, i]
## W = 0.14888, p-value < 2.2e-16
##
##
## $`Sample_S-16-0149_2`
##
## Shapiro-Wilk normality test
##
## data: newX[, i]
## W = 0.26348, p-value < 2.2e-16
##
##
## $`Sample_S-16-0139_4`
##
## Shapiro-Wilk normality test
##
## data: newX[, i]
## W = 0.27444, p-value < 2.2e-16
##
##
## $`Sample_S-16-0142_5`
##
## Shapiro-Wilk normality test
##
## data: newX[, i]
## W = 0.18673, p-value < 2.2e-16
##
##
## $`Sample_S-16-0146_8`
##
## Shapiro-Wilk normality test
##
## data: newX[, i]
## W = 0.23702, p-value < 2.2e-16
##
##
## $`Sample_S-16-0145_0`
##
## Shapiro-Wilk normality test
##
## data: newX[, i]
## W = 0.38746, p-value < 2.2e-16
##
##
## $`Sample_S-16-0140_9`
##
## Shapiro-Wilk normality test
##
## data: newX[, i]
## W = 0.33753, p-value < 2.2e-16
##
##
## $`Sample_S-16-0138_6`
##
## Shapiro-Wilk normality test
##
## data: newX[, i]
## W = 0.16604, p-value < 2.2e-16
La distribution normale des données est basée sur la moyenne et la variance. Si distribution Normale = TESTS PARAMETRIQUES, sinon TESTS NON PARAMETRIQUES (Pour les tests non paramétriques il n’y a pas d’hypothèse sur la distribution des observations des échantillons, ils se basent sur des rangs d’observation)
Test Stat de Kruskal-wallis pour les données non paramétriques (avec n > 2).
2-3 : Analyse de la Taxonomie
La taxonomie virale est basée selon la classification de l’ICTV
L’assignation taxonomique des contigs se fait par BLAST sur les bases de données de 2021 mais il faut faire une correction « manuelle » en corrigeant le NA des colonnes Phylum, Order, Nucleic_acid, Host selon Shi et al., Koonin et al. et l’ICTV.
Le rang Cluster est intermédiaire entre Order et Family.
2-3-2 : Nettoyage des NA du fichier TAX_Table : remettre dans l’ordre
- Harmoniser la dénomination de la colonne
Host. On retrouve souventCulicidae/Diptera/Insecta/Insecta_other/other diptera. En fonction du nom de l’espèce on peut modifier et mettreCulicidaeouArthropod_otherouinsecta_other(NArestant = Arthropodes dans le jeu de données Sénégal : exemple de manip à faire)
taxo_table_Cx <- read_excel(Taxo_table.file)
taxo_table_Cx <- taxo_table_Cx[,-1]
taxo_table_Cx[is.na(taxo_table_Cx$hosts_taxon), "hosts_taxon"] <- "Arthropoda_other"- Phylum
RNAouDNAcorrigé parcluster_phylums’il y a lieu
taxo_table_Cx[unique(grep("RNA", taxo_table_Cx$phylum)), colnames(taxo_table_Cx) %in% "phylum"] <- "cluster_phylum"
taxo_table_Cx[unique(grep("DNA", taxo_table_Cx$phylum)), colnames(taxo_table_Cx) %in% "phylum"] <- "cluster_phylum"- Order
NAcorrigé parcluster_orders’il y a lieu
taxo_table_Cx[is.na(taxo_table_Cx$order), "order"] <- "cluster_order"- Cluster
Permutotetra_unclass=>Permutotetra
taxo_table_Cx[unique(grep("Permutotetra_unclass", taxo_table_Cx$cluster)), colnames(taxo_table_Cx) %in% "cluster"] <- "Permutotetra"- Cluster
Luteo-Sobemo=>Toli_unclass
taxo_table_Cx[unique(grep("Luteo-Sobemo", taxo_table_Cx$cluster)), colnames(taxo_table_Cx) %in% "cluster"] <- "Toli_unclass"- Phylum
Luteo-Sobemo=>Kitrinoviricota
taxo_table_Cx[unique(grep("Luteo-Sobemo", taxo_table_Cx$phylum)), colnames(taxo_table_Cx) %in% "phylum"] <- "Kitriniviricota"- Order
Luteo-Sobemo=>Tolivirales
taxo_table_Cx[unique(grep("Luteo-Sobemo", taxo_table_Cx$order)), colnames(taxo_table_Cx) %in% "order"] <- "Tolivirales"Cette correction se fait par le TaxID en allant chercher les info sur le site du NCBI. Si on ne trouve pas, regarder le blast des contigs et prendre la taxo de l’espèce la plus proche.
2-3-2 : Elimination des OTUs
Dans la colonne
Host, on peut éliminerBacteria/Fungi/Viridiplantae/VertebratesDans la colonne
Order, on peut éliminer lesCaudoviraleset lesLeviviralescar virus de phages
count_otu_taxo_Cx <- merge(count_otu_Cx, taxo_table_Cx, by = "tax_id")
count_otu_taxo_Cx <- count_otu_taxo_Cx %>%
subset(Nucleic_acid != "ssRNA-RT") %>%
subset(hosts_taxon != "Fungi") %>%
subset(hosts_taxon != "Vertebrate") %>%
subset(hosts_taxon != "Viridiplantae") %>%
subset(hosts_taxon != "Bacteria") %>%
subset(order != "Caudovirales") %>%
subset(order != "Levivirales")2-3-3 : Elimination des librairies outlayers
Selon les questions de recherche on peut éliminer les OTUs qui sont dans moins de 5 librairies (si plus de 100 librairies) -> pas besoin pour le Sénégal
Elimination des librairies qui n’ont aucun reads viraux
# Changer noms des échantillons
count_otu_taxo_Cx <- rename.vars(count_otu_taxo_Cx, from = c("Sample_S-16-0138_6","Sample_S-16-0139_4","Sample_S-16-0140_9","Sample_S-16-0141_7","Sample_S-16-0142_5","Sample_S-16-0143_3","Sample_S-16-0144_1","Sample_S-16-0145_0","Sample_S-16-0146_8","Sample_S-16-0147_6","Sample_S-16-0148_4","Sample_S-16-0149_2"), to = c("PF4","PF5","TF4","TF5","PD4","PD5","TD4","TD5","PK4","PK5","TK4","TK5"))
# name : P = C.Poicilipes, T = C.tritaeniorhynchus, F = Ferlo - Ponds, D = Diama - Delta, K = KMS - Lake, 4 = 2014, 5 = 2015
# Elimination librairies sans reads viraux
count_otu_taxo_Cx$sum_reads <- rowSums(count_otu_taxo_Cx[,colnames(count_otu_taxo_Cx) %in% c("PF4","PF5","TF4","TF5","PD4","PD5","TD4","TD5","PK4","PK5","TK4","TK5")])
count_otu_taxo_Cx <- count_otu_taxo_Cx %>%
subset(sum_reads > 0) - L’analyse des reads viraux en fonction du nombre d’OTU (fichier OTU présence/absence et faire la somme totale) par échantillons permet de fixer un seuil et d’éliminer les librairies outlayers quand assez de librairies. Analyser les contrôles (si présence de contrôles)
rownames(count_otu_taxo_Cx) <- count_otu_taxo_Cx$species
abund_OTU_transvers_Cx <- as.data.frame(t(count_otu_taxo_Cx[,colnames(count_otu_taxo_Cx) %in% c("PF4","PF5","TF4","TF5","PD4","PD5","TD4","TD5","PK4","PK5","TK4","TK5")]))
abund_OTU_transvers_Cx$spe_richness <- vegan::specnumber(abund_OTU_transvers_Cx)
abund_OTU_transvers_Cx$nbr_reads <- colSums(count_otu_taxo_Cx[,colnames(count_otu_taxo_Cx) %in% c("PF4","PF5","TF4","TF5","PD4","PD5","TD4","TD5","PK4","PK5","TK4","TK5")])
cor.test(abund_OTU_transvers_Cx$nbr_reads,abund_OTU_transvers_Cx$spe_richness, method="pearson", alternative="two.sided")##
## Pearson's product-moment correlation
##
## data: abund_OTU_transvers_Cx$nbr_reads and abund_OTU_transvers_Cx$spe_richness
## t = 5.7114, df = 10, p-value = 0.0001952
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6044174 0.9644993
## sample estimates:
## cor
## 0.8748533
plot(abund_OTU_transvers_Cx$nbr_reads,abund_OTU_transvers_Cx$spe_richness, xlab = "nbr_reads", ylab= "OTUs", cex = 1.5, pch = 19)
text(20000,55,"cor.pearson = 0.86", font = 4, col = "blue")- L’analyse des reads viraux en fonction du nombre de moustiques présents dans chaque librairies
abund_OTU_transvers_Cx$Sample <- rownames(abund_OTU_transvers_Cx)
abund_OTU_transvers_Cx <- abund_OTU_transvers_Cx[sort(abund_OTU_transvers_Cx$Sample),]
# Mosquito number
abund_OTU_transvers_Cx$nbr_mos <- c(94, 551, 112, 103, 786, 383, 3829,3921, 98, 85, 4062, 3185)
# Species
abund_OTU_transvers_Cx$Host <- rep(c("C. poicilipes", "C. tritaeniorhynchus"), each=6, times=1)
# Sites
abund_OTU_transvers_Cx$Site <- rep(c("Diama", "Ferlo","KMS"), times = 2, each = 2)
# Year
abund_OTU_transvers_Cx$Year <- rep(c("2014", "2015"), each=1, times=6)
cor.test(abund_OTU_transvers_Cx$nbr_mos,abund_OTU_transvers_Cx$nbr_reads, method="pearson", alternative="two.sided")##
## Pearson's product-moment correlation
##
## data: abund_OTU_transvers_Cx$nbr_mos and abund_OTU_transvers_Cx$nbr_reads
## t = 0.30237, df = 10, p-value = 0.7686
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5063796 0.6344288
## sample estimates:
## cor
## 0.0951835
plot(abund_OTU_transvers_Cx$nbr_reads,abund_OTU_transvers_Cx$nbr_mos, xlab = "nbr_reads", ylab= "nbr_mos", cex = 1.5, pch = 19)
text(20000,2000,"cor.pearson = 0.1", font = 4, col = "blue")- Analyse de la composition générale (Nb de phylum/ Nb de Cluster/ Nb d’OTU)
# cluster
unique(count_otu_taxo_Cx$cluster)## [1] "Reo" "Rhabdo" "Birna" "Nege"
## [5] "Mesoni" "Solemo" "Phasma" "Chu"
## [9] "Narna" "Chryso" "Toli_unclass" "Noda"
## [13] "Partiti" "Permutotetra" "Dicistro" "Kelp_fly_cluster"
## [17] "Luteo" "Tombus" "Virga" "Phenui"
## [21] "Ifla" "Polycipi" "Picorna_unclass" "Mononega_unclass"
## [25] "Xinmo" "Toti"
table(count_otu_taxo_Cx$cluster)##
## Birna Chryso Chu Dicistro
## 2 4 2 6
## Ifla Kelp_fly_cluster Luteo Mesoni
## 4 1 4 2
## Mononega_unclass Narna Nege Noda
## 1 3 6 7
## Partiti Permutotetra Phasma Phenui
## 7 3 1 5
## Picorna_unclass Polycipi Reo Rhabdo
## 1 2 8 9
## Solemo Toli_unclass Tombus Toti
## 6 1 3 5
## Virga Xinmo
## 5 1
# order
unique(count_otu_taxo_Cx$order)## [1] "Reovirales" "Mononegavirales" "Birna_order"
## [4] "Martellivirales" "Nidovirales" "Sobelivirales"
## [7] "Bunyavirales" "Jingchuvirales" "Wolframvirales"
## [10] "Ghabrivirales" "Tolivirales" "Nodamuvirales"
## [13] "Durnavirales" "Permutotetra_order" "Picornavirales"
table(count_otu_taxo_Cx$order)##
## Birna_order Bunyavirales Durnavirales Ghabrivirales
## 2 6 7 9
## Jingchuvirales Martellivirales Mononegavirales Nidovirales
## 2 11 11 2
## Nodamuvirales Permutotetra_order Picornavirales Reovirales
## 7 3 14 8
## Sobelivirales Tolivirales Wolframvirales
## 6 8 3
# Phylum
unique(count_otu_taxo_Cx$phylum)## [1] "Duplornaviricota" "Negarnaviricota" "Incertae sedis" "Kitrinoviricota"
## [5] "Pisuviricota" "Lenarviricota"
table(count_otu_taxo_Cx$phylum)##
## Duplornaviricota Incertae sedis Kitrinoviricota Lenarviricota
## 17 6 25 3
## Negarnaviricota Pisuviricota
## 19 29
PS : Si travail sur clusters faire ces manips là :
# Pour obtenir une table de comptage pour les clusters
count_cluster_Cx <- aggregate( . ~ cluster, data = count_otu_taxo_Cx[,colnames(count_otu_taxo_Cx) %in% c("PF4","PF5","TF4","TF5","PD4","PD5","TD4","TD5","PK4","PK5","TK4","TK5", "cluster")], function(x) x=sum(x))
rownames(count_cluster_Cx) <- count_cluster_Cx$cluster
count_cluster_Cx$sum_reads <- apply(count_cluster_Cx[,!colnames(count_cluster_Cx) %in% ("cluster")], 1, sum)
datatable(count_cluster_Cx, editable = 'cell', filter = 'top')Pour le reste on est sur le même script/les mêmes manips
2.5 : Visualisation de la structure par ordination et heatmap
Test de plusieurs méthodes de distance, comparaison des indices de Bray-curtis et Jaccard (Package
Phyloseqaussi utilisable)Représentation par Heatmap (
Phyloseqou autres packages :ComplexHeatmap,pheatmap,superheat, …)
HT1 <- Heatmap(as.matrix(count_otu_taxo_Cx[,2:13]), name = "relative abundance",
row_names_gp = gpar(fontsize = 6),
column_names_side = "top",
column_names_rot = 0,
heatmap_legend_param = list(direction = "horizontal"),
row_names_side = "left",
row_order = order(-count_otu_taxo_Cx$sum_reads),
top_annotation = HeatmapAnnotation(species = c("C. tritaeniorhynchus","C. tritaeniorhynchus","C. poicilipes","C. tritaeniorhynchus","C. poicilipes","C. tritaeniorhynchus","C. poicilipes","C. poicilipes","C. poicilipes","C. tritaeniorhynchus","C. tritaeniorhynchus","C. poicilipes"),
col = list(species = c("C. poicilipes" = "#F8766D", "C. tritaeniorhynchus"="#00BFC4"),
which = "column"),
show_legend = F),
rect_gp = gpar(col = "white", lwd = 1),
column_names_gp = gpar(col = c("#00BFC4","#00BFC4","#F8766D","#00BFC4","#F8766D","#00BFC4","#F8766D","#F8766D","#F8766D","#00BFC4","#00BFC4","#F8766D"), font = 2))
draw(HT1, heatmap_legend_side = "bottom", annotation_legend_side = NULL) - Test de plusieurs méthodes d’ordination par le calcul d’une matrice de distance sur la table d’abondance. (MDS-nMDS-PCoA….)
On désigne sous le terme de MDS (multidimensional scaling), un ensemble de techniques permettant de représenter les données d’une matrice de proximité entre objets à l’aide de modèles de distances spatiales. Le MDS essaie de représenter des échantillons en 2 dimensions tout en préservant les distances.
Le nMDS (non Metric Multidimensional scale) place les points selon les distances calculées avec la méthode choisie (bray/jaccard/sorensen), et représente au mieux les relations de ressemblances entre objets. Dans le nMDS non métrique, on cherche à préserver l’ordre des proximités et non leurs valeurs absolues ou relatives. Autrement dit, le but est de représenter les distances entre les objets, en respectant l’ordre entre les proximités plutôt que leurs valeurs exactes.
La PCoA équivaut à une analyse en composante principales (ACP) mais préserve la diversité beta au lieu de la variance. La PCoA représente les groupes en utilisant la matrice de distance afin de regrouper les échantillons selon leur « association ».
Exemple nNMDS/Bray-Curtis data SENEGAL
set.seed(123)
spe.nmds = metaMDS(abund_OTU_transvers_Cx[,!colnames(abund_OTU_transvers_Cx) %in% c("spe_richness", "nbr_reads","Sample","nbr_mos", "shannon", "simpson", "Host", "Site", "Year")], distance = "bray")## Square root transformation
## Wisconsin double standardization
## Run 0 stress 0.1460129
## Run 1 stress 0.1460129
## ... New best solution
## ... Procrustes: rmse 2.583069e-05 max resid 4.877324e-05
## ... Similar to previous best
## Run 2 stress 0.1460129
## ... Procrustes: rmse 9.320214e-05 max resid 0.0002102037
## ... Similar to previous best
## Run 3 stress 0.1551606
## Run 4 stress 0.1551607
## Run 5 stress 0.1551611
## Run 6 stress 0.1460129
## ... Procrustes: rmse 2.663715e-05 max resid 5.256511e-05
## ... Similar to previous best
## Run 7 stress 0.1460129
## ... Procrustes: rmse 1.112265e-05 max resid 2.221586e-05
## ... Similar to previous best
## Run 8 stress 0.2682926
## Run 9 stress 0.347306
## Run 10 stress 0.1460129
## ... Procrustes: rmse 6.05382e-05 max resid 0.000144601
## ... Similar to previous best
## Run 11 stress 0.146013
## ... Procrustes: rmse 0.0001782006 max resid 0.0004172081
## ... Similar to previous best
## Run 12 stress 0.1551613
## Run 13 stress 0.1460129
## ... New best solution
## ... Procrustes: rmse 1.98754e-05 max resid 4.765487e-05
## ... Similar to previous best
## Run 14 stress 0.1551607
## Run 15 stress 0.1460129
## ... Procrustes: rmse 9.573243e-06 max resid 1.776777e-05
## ... Similar to previous best
## Run 16 stress 0.1460129
## ... Procrustes: rmse 0.0001178876 max resid 0.0002805994
## ... Similar to previous best
## Run 17 stress 0.2036297
## Run 18 stress 0.1460129
## ... New best solution
## ... Procrustes: rmse 4.80723e-06 max resid 9.211136e-06
## ... Similar to previous best
## Run 19 stress 0.1460129
## ... Procrustes: rmse 2.406251e-05 max resid 4.170685e-05
## ... Similar to previous best
## Run 20 stress 0.1551609
## *** Solution reached
spe.nmds##
## Call:
## metaMDS(comm = abund_OTU_transvers_Cx[, !colnames(abund_OTU_transvers_Cx) %in% c("spe_richness", "nbr_reads", "Sample", "nbr_mos", "shannon", "simpson", "Host", "Site", "Year")], distance = "bray")
##
## global Multidimensional Scaling using monoMDS
##
## Data: wisconsin(sqrt(abund_OTU_transvers_Cx[, !colnames(abund_OTU_transvers_Cx) %in% c("spe_richness", "nbr_reads", "Sample", "nbr_mos", "shannon", "simpson", "Host", "Site", "Year")]))
## Distance: bray
##
## Dimensions: 2
## Stress: 0.1460129
## Stress type 1, weak ties
## Two convergent solutions found after 20 tries
## Scaling: centring, PC rotation, halfchange scaling
## Species: expanded scores based on 'wisconsin(sqrt(abund_OTU_transvers_Cx[, !colnames(abund_OTU_transvers_Cx) %in% c("spe_richness", "nbr_reads", "Sample", "nbr_mos", "shannon", "simpson", "Host", "Site", "Year")]))'
plot(spe.nmds)# extract NMDS scores (x and y coordinates)
data.scores = as.data.frame(scores(spe.nmds))
# add columns to data frame
data.scores$Host = abund_OTU_transvers_Cx$Host
data.scores$Site = abund_OTU_transvers_Cx$Site
data.scores$Year = abund_OTU_transvers_Cx$Year
head(data.scores)# nMDS
nMDS_plot = ggplot(data.scores, aes(x = NMDS1, y = NMDS2)) +
geom_point(aes(shape = Site, colour = Host), size = 4) +
theme(axis.text.y = element_text(colour = "black", size = 12, face = "bold"),
axis.text.x = element_text(colour = "black", face = "bold", size = 12),
legend.text = element_text(size = 6, face ="bold", colour ="black"),
legend.position = "right", axis.title.y = element_text(face = "bold", size = 14),
axis.title.x = element_text(face = "bold", size = 14, colour = "black"),
legend.title = element_text(size = 6, colour = "black", face = "bold"),
panel.background = element_blank(), panel.border = element_rect(colour = "black", fill = NA, size = 1.2),
legend.key=element_blank(), plot.title = element_text(color="black", size=14, face="bold.italic")) +
ggtitle("nMDS / Bray curtis") +
labs(x = "NMDS1", y = "NMDS2") +
scale_colour_manual(values = c("#F8766D", "#00BFC4", "#009E73")) + theme_bw() + theme(legend.position = "none")
nMDS_plotExemple PCoA/Bray-Curtis data SENEGAL
# Transformation matrice bray-curtis
spe.bray<-vegdist(abund_OTU_transvers_Cx[,!colnames(abund_OTU_transvers_Cx) %in% c("spe_richness", "nbr_reads","Sample","nbr_mos", "shannon", "simpson", "Host", "Site", "Year")], method="bray")
spe.bray## PD4 PD5 PF4 PF5 PK4 PK5 TD4
## PD5 0.9682807
## PF4 0.6971984 0.7958871
## PF5 0.9787408 0.8583456 0.8322459
## PK4 0.7646237 0.7191033 0.4387796 0.8469308
## PK5 0.6974376 0.8814373 0.2574623 0.9153413 0.3903010
## TD4 0.7712603 0.9418620 0.8415766 0.9850261 0.8489397 0.8278310
## TD5 0.8572858 0.8801325 0.8560616 0.9534993 0.9107081 0.8569102 0.3607216
## TF4 0.9244705 0.8689642 0.8766030 0.8463445 0.9546617 0.8765134 0.8418737
## TF5 0.9819942 0.8258135 0.8555222 0.7326594 0.9550218 0.8953270 0.7473699
## TK4 0.9743679 0.8316359 0.8274181 0.6991281 0.9472738 0.8356362 0.8280006
## TK5 0.7865964 0.6833520 0.7966039 0.8868651 0.7249280 0.7483579 0.7830533
## TD5 TF4 TF5 TK4
## PD5
## PF4
## PF5
## PK4
## PK5
## TD4
## TD5
## TF4 0.6887704
## TF5 0.6833517 0.5436308
## TK4 0.7619412 0.6337373 0.6705511
## TK5 0.7082158 0.7310793 0.8126667 0.5172863
Construction de la PCoA :
pcoa <- cmdscale(spe.bray, k=3, eig = T, add = T) # avec k=3 ici donc sélection de trois axes, mais visualisation avec 2 seulement avec PCoA (pour simplification)
pcoa %>% head()## $points
## [,1] [,2] [,3]
## PD4 0.31898285 0.284917152 -0.095818760
## PD5 0.01778839 -0.290512563 0.444249650
## PF4 0.41356949 -0.069678453 -0.147139871
## PF5 -0.13819115 -0.442468394 -0.305953139
## PK4 0.48507464 -0.104818579 0.061590105
## PK5 0.44285188 0.009925664 -0.120779002
## TD4 -0.11927073 0.499743102 -0.006005794
## TD5 -0.26286506 0.413150765 0.057711220
## TF4 -0.37195959 -0.024706429 -0.096539397
## TF5 -0.40038408 -0.060544533 -0.194061391
## TK4 -0.32557668 -0.189326692 0.038165364
## TK5 -0.06001995 -0.025681040 0.364581016
##
## $eig
## [1] 1.215206e+00 8.385110e-01 5.248996e-01 4.316936e-01 3.817891e-01
## [6] 3.325263e-01 1.627872e-01 1.315949e-01 7.299114e-02 2.631405e-02
## [11] 8.865175e-17 -2.351641e-17
##
## $x
## NULL
##
## $ac
## [1] 0.06110151
##
## $GOF
## [1] 0.6261342 0.6261342
positions <- pcoa$points
colnames(positions) <- c("pcoa1", "pcoa2", "pcoa3")
positions %>% head()## pcoa1 pcoa2 pcoa3
## PD4 0.31898285 0.284917152 -0.09581876
## PD5 0.01778839 -0.290512563 0.44424965
## PF4 0.41356949 -0.069678453 -0.14713987
## PF5 -0.13819115 -0.442468394 -0.30595314
## PK4 0.48507464 -0.104818579 0.06159010
## PK5 0.44285188 0.009925664 -0.12077900
percent_explained <- 100 * pcoa$eig / sum(pcoa$eig)
pretty_pe <- format(round(percent_explained[1:3], digits =1), nsmall = 1, trim = T)
labs <- c(glue("PCo1 ({pretty_pe[1]}%)"),
glue("PCo2 ({pretty_pe[2]}%)"),
glue("PCo3 ({pretty_pe[3]}%)"))
positions <- as.data.frame(positions)
positions$Host <- abund_OTU_transvers_Cx$Host
positions$Site <- abund_OTU_transvers_Cx$Site
positions$Year <- abund_OTU_transvers_Cx$Year
positions %>%
as_tibble(rownames="samples") %>%
ggplot(aes(x=pcoa1, y=pcoa2, color=Host, shape=Site)) +
geom_point(size = 4) +
labs(x = labs[1], y = labs[2])Créer un scree plot pour observer la quantité de variations de chaque axe :
Ici, les deux premiers axes représentent quasiment 50% de la variation totale (si on veut prendre plus de variation on rajoute des axes)
tibble(pe = cumsum(percent_explained),
axis = 1:length(percent_explained)) %>%
ggplot(aes(x=axis, y=pe)) +
geom_line() Construction d’une PCoA en 3D pour visualiser plus facilement plus de 2 axes (ajout de la dimension z) :
positions <- positions %>%
mutate(Host = factor(Host, levels=c("C. poicilipes", "C. tritaeniorhynchus")),
Host_color = case_when(Host == "C. poicilipes" ~ "blue",
Host == "C. tritaeniorhynchus" ~ "red",
TRUE ~ NA_character_))
plot3d(x=positions$pcoa1, y=positions$pcoa2, z=positions$pcoa3,
xlab = labs[1], ylab = labs[2], zlab = labs[3],
col=positions$Host_color, type="s", size=1)2.6 : Pour aller plus loin
2.6.1 : Différences indices de dissimilarité (diversité Beta)
Revoir la définition des indices écologiques dans la partie “2.4.4 : Exploration de la biodiversité Beta”
Si question sur cette partie, demander à Serafin ou Côme
## Jaccard, Bray-Curtis and Sorensen indexes
jacc_tab <- vegdist(abund_OTU_transvers_Cx[,!colnames(abund_OTU_transvers_Cx) %in% c("spe_richness", "nbr_reads", "Sample", "nbr_mos", "shannon", "simpson", "Host", "Site", "Year")], method="jaccard", binary = T)
Bray_tab <- vegdist(abund_OTU_transvers_Cx[,!colnames(abund_OTU_transvers_Cx) %in% c("spe_richness", "nbr_reads", "Sample", "nbr_mos", "shannon", "simpson", "Host", "Site", "Year")], method="bray")
Sor_tab <- vegdist(abund_OTU_transvers_Cx[,!colnames(abund_OTU_transvers_Cx) %in% c("spe_richness", "nbr_reads", "Sample", "nbr_mos", "shannon", "simpson", "Host", "Site", "Year")], method="bray", binary = T)
# Transformation in matrix
jacc_tab2 <- as.matrix(jacc_tab)
Bray_tab2 <- as.matrix(Bray_tab)
Sor_tab2 <- as.matrix(Sor_tab)
jacc_tab3 <- melt(as.matrix(jacc_tab), varnames = c("row", "col"))
Bray_tab3 <- melt(as.matrix(Bray_tab), varnames = c("row", "col"))
Sor_tab3 <- melt(as.matrix(Sor_tab), varnames = c("row", "col"))
# Download in csv files
write.csv2(jacc_tab2, file = "jaccard_OTU.csv")
write.csv2(Bray_tab2, file = "BrayCurtis_OTU.csv")
write.csv2(Sor_tab2, file = "Sorensen_OTU.csv")
write.csv2(jacc_tab3, file = "jaccard_OTU_2.csv")
write.csv2(Bray_tab3, file = "BrayCurtis_OTU_2.csv")
write.csv2(Sor_tab3, file = "Sorensen_OTU_2.csv")
## the two files have been merged in dist_matrixes_Jac_Bray.xlsx (reports folder)
dist_matrixes_Jac_Bray_OTU <- read_excel(Dist_table.file) # in reports folder
# Fig. with all point/all librarie comparison.
fig_betaindexes <- dist_matrixes_Jac_Bray_OTU %>%
filter(value != 0) %>%
ggplot() +
geom_jitter(aes(x = row, y = value, colour = col), width = 0.1) +
xlab("") +
ylab("") +
theme_bw() +
facet_wrap(~ index)
fig_betaindexes <- fig_betaindexes +
theme(axis.text.x = element_text(size = 9, angle = 25, hjust = 1, debug = FALSE),
legend.title = element_blank())
# statistical analysis of beta indexes: compare BC and Jaccard/Sorensen depending on species
dist_Jac_Bray2 <- read_excel(Dist_matrix.file) # in reports folder
# change the dataframe structure
dist_Jac_Bray3 <- dist_Jac_Bray2 %>%
pivot_longer(., starts_with(c("P", "T")),
names_to = "sample",
values_to = "estimate",
values_drop_na = TRUE) %>%
separate(., col = "sample", into = c("Nothing1", "Host1", "habitat1", "year1"),
sep = "", remove = FALSE) %>%
separate(., col = "row", into = c("Nothing2", "Host2", "habitat2", "year2"),
sep = "", remove = FALSE)
dist_Jac_Bray3 <- dist_Jac_Bray3[,! colnames(dist_Jac_Bray3) %in% c("Nothing1", "Nothing2")]
dist_Jac_Bray3 <- dist_Jac_Bray3 %>%
mutate(Host = if_else(Host1 == Host2, "same", "different"))
dist_Jac_Bray3 <- dist_Jac_Bray3 %>%
mutate(Habitat = if_else(habitat1 == habitat2, "same", "different"))
dist_Jac_Bray3 <- dist_Jac_Bray3 %>%
mutate(Year = if_else(year1 == year2, "same", "different"))
dist_Jac <- dist_Jac_Bray3 %>% filter(index == "jaccard")
dist_BC <- dist_Jac_Bray3 %>% filter(index == "bray-curtis")
dist_sor <- dist_Jac_Bray3 %>% filter(index == "sorensen")
# statistic tests
wilcox.test(dist_Jac$estimate ~ dist_Jac$Host) # p-value = 0.01467##
## Wilcoxon rank sum test with continuity correction
##
## data: dist_Jac$estimate by dist_Jac$Host
## W = 730, p-value = 0.01467
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(dist_BC$estimate ~ dist_BC$Host) # p-value = 2.877e-05##
## Wilcoxon rank sum exact test
##
## data: dist_BC$estimate by dist_BC$Host
## W = 854, p-value = 2.877e-05
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(dist_sor$estimate ~ dist_sor$Host) # p-value = 0.01467##
## Wilcoxon rank sum test with continuity correction
##
## data: dist_sor$estimate by dist_sor$Host
## W = 730, p-value = 0.01467
## alternative hypothesis: true location shift is not equal to 0
kruskal.test(dist_Jac$estimate ~ dist_Jac$Habitat) # p-value = 0.1116##
## Kruskal-Wallis rank sum test
##
## data: dist_Jac$estimate by dist_Jac$Habitat
## Kruskal-Wallis chi-squared = 2.5314, df = 1, p-value = 0.1116
kruskal.test(dist_BC$estimate ~ dist_BC$Habitat) # p-value = 0.4986##
## Kruskal-Wallis rank sum test
##
## data: dist_BC$estimate by dist_BC$Habitat
## Kruskal-Wallis chi-squared = 0.45792, df = 1, p-value = 0.4986
kruskal.test(dist_Jac$estimate ~ dist_Jac$Year) # p-value = 0.7281##
## Kruskal-Wallis rank sum test
##
## data: dist_Jac$estimate by dist_Jac$Year
## Kruskal-Wallis chi-squared = 0.12091, df = 1, p-value = 0.7281
kruskal.test(dist_BC$estimate ~ dist_BC$Year) # p-value = 0.3882##
## Kruskal-Wallis rank sum test
##
## data: dist_BC$estimate by dist_BC$Year
## Kruskal-Wallis chi-squared = 0.74444, df = 1, p-value = 0.3882
# Fig. compare indexes depending on shared Host or Habitat
fig_betaindexes_Host <- dist_Jac_Bray3 %>%
ggplot(aes(x = Host, y = estimate)) +
geom_jitter(aes(colour = if_else(Host == "same", Host1, "black")), width = 0.1) +
scale_color_manual(name = "Host1", values = c("black", "blue", "red", "purple")) +
xlab("") +
ylab("") +
theme_bw() +
facet_wrap(~ index) +
geom_signif(comparisons = list(c("different", "same")), test = "wilcox.test", map_signif_level=TRUE, tip_length = 0, col="black")
fig_betaindexes_Host <- fig_betaindexes_Host + theme(legend.position = "none") +
ggtitle("Host")
fig_betaindexes_Habitat <- dist_Jac_Bray3 %>%
ggplot(aes(x = Habitat, y = estimate)) +
geom_jitter(aes(colour = if_else(Habitat == "same", habitat1, "black")), width = 0.1) +
scale_color_manual(name = "habitat1", values = c("black", "blue", "red", "purple")) +
xlab("") +
ylab("") +
theme_bw() +
facet_wrap(~ index) +
geom_signif(comparisons = list(c("different", "same")), test = "wilcox.test", map_signif_level=TRUE, tip_length = 0, col="black")
fig_betaindexes_Habitat <- fig_betaindexes_Habitat + theme(legend.position = "none") +
ggtitle("Habitat")
# just Host BC vs Jac
fig_betaindexes_Host_dif <- dist_Jac_Bray3 %>%
filter(Host %in% "different") %>%
ggplot(aes(x = index, y = estimate)) +
geom_boxplot(fill='#d1cfcf', color="black") +
geom_jitter(aes(colour = if_else(Host == "same", Host1, "black")), width = 0.1, size = 2) +
scale_color_manual(name = "Host1", values = c("black", "blue", "red", "purple")) +
xlab("") +
ylab("") +
theme_bw() +
# geom_hline(yintercept=0.5, linetype="dashed",
# color = "red", size=1) +
theme(axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12))
fig_betaindexes_Host_dif <- fig_betaindexes_Host_dif + theme(legend.position = "none") +
ggtitle("Host_dif")
fig_betaindexes/(fig_betaindexes_Host|fig_betaindexes_Host_dif) +
plot_annotation(tag_levels = 'A')# Moyenne Bray-Curtis
dist_Jac_Bray3 %>%
filter(Host %in% "different") %>%
filter(index %in% "bray-curtis") %>%
summarise(index = median(estimate, na.rm = TRUE)) # mean = 0.8569679# Moyenne Jaccard
dist_Jac_Bray3 %>%
filter(Host %in% "different") %>%
filter(index %in% "jaccard") %>%
summarise(index = median(estimate, na.rm = TRUE)) # mean = 0.6545107# Moyenne Sorensen
dist_Jac_Bray3 %>%
filter(Host %in% "different") %>%
filter(index %in% "sorensen") %>%
summarise(index = median(estimate, na.rm = TRUE)) # mean = 0.48645952.6.2 : Recherche des parties indicatrices/marqueurs dans nos données
Indicator species are:
“A species whose status provides information on the overall condition of the ecosystem and of other species in that ecosystem. They reflect the quality and changes in environmental conditions as well as aspects of community composition.” - United Nations Environment Programme (1996)
Tutorial here
abund = abund_OTU_transvers_Cx[,1:99]
host = abund_OTU_transvers_Cx$Host
inv = multipatt(abund, host, func = "r.g", control = how(nperm=9999))
summary(inv)##
## Multilevel pattern analysis
## ---------------------------
##
## Association function: r.g
## Significance level (alpha): 0.05
##
## Total number of species: 99
## Selected number of species: 16
## Number of species associated to 1 group: 16
##
## List of species associated to each combination:
##
## Group C. poicilipes #sps. 6
## stat p.value
## Culex vishnui subgroup totivirus 0.735 0.0174 *
## Culex tritaeniorhynchus Negev-like virus 0.673 0.0050 **
## Cordoba virus 0.595 0.0326 *
## Broome reo-like virus 1 0.555 0.0174 *
## Hubei diptera virus 20 0.538 0.0174 *
## Beihai narna-like virus 22 0.486 0.0100 **
##
## Group C. tritaeniorhynchus #sps. 10
## stat p.value
## Culex associated luteo like virus 0.695 0.0192 *
## Fitzroy Crossing toti-like virus 2 0.683 0.0044 **
## Bat sobemovirus 0.611 0.0377 *
## Broome luteo-like virus 1 0.534 0.0276 *
## Xanthi chryso-like virus 0.522 0.0111 *
## Tritaeniorhynchus merhavirus 0.510 0.0151 *
## Soufli chryso-like virus 0.506 0.0280 *
## Hubei virga-like virus 2 0.473 0.0054 **
## Alexandroupolis virga-like virus 0.465 0.0196 *
## Culex mononega-like virus 2 0.444 0.0428 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Use a boxplot to show the differences in distribution of these identified, statistically significant species between groupings.
abund_indic = melt(abund_OTU_transvers_Cx[,c(1:99,104)], id = "Host")
ggplot(abund_indic, aes(x = variable, y = value, fill = Host)) +
geom_boxplot(colour = "black", position = position_dodge(0.5)) +
theme(legend.title = element_text(size = 12, face = "bold"),
legend.text = element_text(size = 10, face = "bold"), legend.position = "right",
axis.text.x = element_text(face = "bold",colour = "black", size = 12, angle=90),
axis.text.y = element_text(face = "bold", size = 12, colour = "black"),
axis.title.y = element_text(face = "bold", size = 14, colour = "black"),
panel.background = element_blank(),
panel.border = element_rect(fill = NA, colour = "black"),
legend.key=element_blank()) +
labs(x= "", y = "Relative Abundance (%)", fill = "Time") +
scale_fill_manual(values = c("steelblue2", "steelblue4"))abund_indic_spec = data.frame(Host = abund_OTU_transvers_Cx$Host,
`Culex vishnui subgroup totivirus` = abund_OTU_transvers_Cx$`Culex vishnui subgroup totivirus`,
`Culex tritaeniorhynchus Negev-like virus` = abund_OTU_transvers_Cx$`Culex tritaeniorhynchus Negev-like virus`,
`Cordoba virus` = abund_OTU_transvers_Cx$`Cordoba virus`,
`Broome reo-like virus 1` = abund_OTU_transvers_Cx$`Broome reo-like virus 1`,
`Hubei diptera virus 20` = abund_OTU_transvers_Cx$`Hubei diptera virus 20`,
`Beihai narna-like virus 22` = abund_OTU_transvers_Cx$`Beihai narna-like virus 22`,
`Culex associated luteo like virus` = abund_OTU_transvers_Cx$`Culex associated luteo like virus`,
`Fitzroy Crossing toti-like virus 2` = abund_OTU_transvers_Cx$`Fitzroy Crossing toti-like virus 2`,
`Bat sobemovirus` = abund_OTU_transvers_Cx$`Bat sobemovirus`,
`Broome luteo-like virus 1` = abund_OTU_transvers_Cx$`Broome luteo-like virus 1`,
`Xanthi chryso-like virus` = abund_OTU_transvers_Cx$`Xanthi chryso-like virus`,
`Tritaeniorhynchus merhavirus` = abund_OTU_transvers_Cx$`Tritaeniorhynchus merhavirus`,
`Soufli chryso-like virus` = abund_OTU_transvers_Cx$`Soufli chryso-like virus`,
`Hubei virga-like virus 2` = abund_OTU_transvers_Cx$`Hubei virga-like virus 2`,
`Alexandroupolis virga-like virus` = abund_OTU_transvers_Cx$`Alexandroupolis virga-like virus`,
`Culex mononega-like virus 2` = abund_OTU_transvers_Cx$`Culex mononega-like virus 2`)
abund_indic_spec_m = melt(abund_indic_spec, id = "Host")
ggplot(abund_indic_spec_m, aes(x = variable, y = value, fill = Host)) +
geom_boxplot(colour = "black", position = position_dodge(0.5)) +
geom_vline(xintercept = rep(1.5:15.5, each = 1), colour = "grey85", size = 1.2) +
theme(legend.title = element_text(size = 12, face = "bold"),
legend.text = element_text(size = 10, face = "bold"), legend.position = "right",
axis.text.x = element_text(face = "bold",colour = "black", size = 12, angle=90),
axis.text.y = element_text(face = "bold", size = 12, colour = "black"),
axis.title.y = element_text(face = "bold", size = 14, colour = "black"),
panel.background = element_blank(),
panel.border = element_rect(fill = NA, colour = "black"),
legend.key=element_blank()) +
labs(x= "", y = "Relative Abundance (%)", fill = "Time") +
scale_fill_manual(values = c("steelblue2", "steelblue4"))2.6.3 : Betadisperser pour tester l’homogénéité des dispersions multivariées
Betadisper tests whether two or more groups (for example, grazed and ungrazed sites) are homogeneously dispersed in relation to their species in studied samples. This test can be done to see if one group has more compositional variance than another. Moreover, homogeneity of dispersion among groups is very advisable to have if you want to test if two or more groups have different compositions, which is tested by adonis.
Betadisper first calculates the average distance of group members to the group centroid in multivariate space (generated by a distance matrix). Then, an ANOVA is done to test if the dispersions (variances) of groups are different.
# Host :
# Bray-Curtis distances between samples
dis <- vegdist(abund_OTU_transvers_Cx[,!colnames(abund_OTU_transvers_Cx) %in% c("spe_richness", "nbr_reads", "Sample", "nbr_mos", "shannon", "simpson", "Host", "Site", "Year")])
# Calculate multivariate dispersions
mod <- betadisper(dis, abund_OTU_transvers_Cx$Host)
mod##
## Homogeneity of multivariate dispersions
##
## Call: betadisper(d = dis, group = abund_OTU_transvers_Cx$Host)
##
## No. of Positive Eigenvalues: 10
## No. of Negative Eigenvalues: 1
##
## Average distance to median:
## C. poicilipes C. tritaeniorhynchus
## 0.4643 0.4500
##
## Eigenvalues for PCoA axes:
## (Showing 8 of 11 eigenvalues)
## PCoA1 PCoA2 PCoA3 PCoA4 PCoA5 PCoA6 PCoA7 PCoA8
## 1.0978 0.7530 0.4624 0.3755 0.3272 0.2831 0.1294 0.1008
# Visualization of the multivariate homogeneity of group dispersions
# extract the centroids and the site points in multivariate space.
centroids<-data.frame(grps=rownames(mod$centroids),data.frame(mod$centroids))
vectors<-data.frame(group=mod$group,data.frame(mod$vectors))
# to create the lines from the centroids to each point we will put it in a format that ggplot can handle
seg.data<-cbind(vectors[,1:3],centroids[rep(1:nrow(centroids),as.data.frame(table(vectors$group))$Freq),2:3])
names(seg.data)<-c("group","v.PCoA1","v.PCoA2","PCoA1","PCoA2")
# create the convex hulls of the outermost points
grp1.hull<-seg.data[seg.data$group=="C. poicilipes",1:3][chull(seg.data[seg.data$group=="C. poicilipes",2:3]),]
grp2.hull<-seg.data[seg.data$group=="C. tritaeniorhynchus",1:3][chull(seg.data[seg.data$group=="C. tritaeniorhynchus",2:3]),]
all.hull<-rbind(grp1.hull,grp2.hull)
# Graphic
panel.a <- ggplot() +
geom_point(data=centroids[1,1:3], aes(x=PCoA1,y=PCoA2),size=4,colour="red",shape=16) +
geom_point(data=seg.data[1:6,], aes(x=v.PCoA1,y=v.PCoA2),size=2,shape=16) +
labs(title="C. poicilipes",x="",y="") +
coord_cartesian(xlim = c(-0.5,0.5), ylim = c(-0.5,0.5)) +
theme_bw() +
theme(legend.position="none")
panel.b <- ggplot() +
geom_point(data=centroids[2,1:3], aes(x=PCoA1,y=PCoA2),size=4,colour="red",shape=17) +
geom_point(data=seg.data[7:12,], aes(x=v.PCoA1,y=v.PCoA2),size=2,shape=17) +
labs(title="C. tritaeniorhynchus",x="",y="") +
coord_cartesian(xlim = c(-0.5,0.5), ylim = c(-0.5,0.5)) +
theme_bw() +
theme(legend.position="none")
panel.c <- ggplot() +
geom_point(data=centroids[,1:3], aes(x=PCoA1,y=PCoA2,shape=grps),size=4,colour="red") +
geom_point(data=seg.data, aes(x=v.PCoA1,y=v.PCoA2,shape=group),size=2) +
labs(title="All",x="",y="") +
coord_cartesian(xlim = c(-0.5,0.5), ylim = c(-0.5,0.5)) +
theme_bw() +
theme(legend.position="none")
grid.arrange(panel.a,panel.b,panel.c,nrow=1)# Site :
# Calculate multivariate dispersions
mod <- betadisper(dis, abund_OTU_transvers_Cx$Site)
mod##
## Homogeneity of multivariate dispersions
##
## Call: betadisper(d = dis, group = abund_OTU_transvers_Cx$Site)
##
## No. of Positive Eigenvalues: 10
## No. of Negative Eigenvalues: 1
##
## Average distance to median:
## Diama Ferlo KMS
## 0.4793 0.4766 0.4357
##
## Eigenvalues for PCoA axes:
## (Showing 8 of 11 eigenvalues)
## PCoA1 PCoA2 PCoA3 PCoA4 PCoA5 PCoA6 PCoA7 PCoA8
## 1.0978 0.7530 0.4624 0.3755 0.3272 0.2831 0.1294 0.1008
# Visualization of the multivariate homogeneity of group dispersions
# extract the centroids and the site points in multivariate space.
centroids <- data.frame(grps=rownames(mod$centroids),data.frame(mod$centroids))
vectors <- data.frame(group=mod$group,data.frame(mod$vectors))
# to create the lines from the centroids to each point we will put it in a format that ggplot can handle
seg.data <- cbind(vectors[,1:3],centroids[rep(1:nrow(centroids),as.data.frame(table(vectors$group))$Freq),2:3])
names(seg.data) <- c("group","v.PCoA1","v.PCoA2","PCoA1","PCoA2")
# create the convex hulls of the outermost points
grp1.hull <- seg.data[seg.data$group=="Diama",1:3][chull(seg.data[seg.data$group=="Diama",2:3]),]
grp2.hull <- seg.data[seg.data$group=="Ferlo",1:3][chull(seg.data[seg.data$group=="Ferlo",2:3]),]
grp3.hull <- seg.data[seg.data$group=="KMS",1:3][chull(seg.data[seg.data$group=="KMS",2:3]),]
all.hull <- rbind(grp1.hull,grp2.hull,grp3.hull)
# Graphic
panel.a <- ggplot() +
geom_point(data=centroids[1,1:3], aes(x=PCoA1,y=PCoA2),size=4,colour="red",shape=16) +
geom_point(data=seg.data[c(1:2,7:8),], aes(x=v.PCoA1,y=v.PCoA2),size=2,shape=16) +
labs(title="Diama",x="",y="") +
coord_cartesian(xlim = c(-0.5,0.5), ylim = c(-0.5,0.5)) +
theme_bw() +
theme(legend.position="none")
panel.b <- ggplot() +
geom_point(data=centroids[2,1:3], aes(x=PCoA1,y=PCoA2),size=4,colour="red",shape=17) +
geom_point(data=seg.data[c(3:4,9:10),], aes(x=v.PCoA1,y=v.PCoA2),size=2,shape=17) +
labs(title="Ferlo",x="",y="") +
coord_cartesian(xlim = c(-0.5,0.5), ylim = c(-0.5,0.5)) +
theme_bw() +
theme(legend.position="none")
panel.c <- ggplot() +
geom_point(data=centroids[2,1:3], aes(x=PCoA1,y=PCoA2),size=4,colour="red",shape=15) +
geom_point(data=seg.data[c(5:6,11:12),], aes(x=v.PCoA1,y=v.PCoA2),size=2,shape=15) +
labs(title="KMS",x="",y="") +
coord_cartesian(xlim = c(-0.5,0.5), ylim = c(-0.5,0.5)) +
theme_bw() +
theme(legend.position="none")
panel.d <- ggplot() +
geom_point(data=centroids[,1:3], aes(x=PCoA1,y=PCoA2,shape=grps),size=4,colour="red") +
geom_point(data=seg.data, aes(x=v.PCoA1,y=v.PCoA2,shape=group),size=2) +
labs(title="All",x="",y="") +
coord_cartesian(xlim = c(-0.5,0.5), ylim = c(-0.5,0.5)) +
theme_bw() +
theme(legend.position="none")
grid.arrange(panel.a,panel.b,panel.c,panel.d,nrow=1)# Year :
# Calculate multivariate dispersions
mod <- betadisper(dis, abund_OTU_transvers_Cx$Year)
mod##
## Homogeneity of multivariate dispersions
##
## Call: betadisper(d = dis, group = abund_OTU_transvers_Cx$Year)
##
## No. of Positive Eigenvalues: 10
## No. of Negative Eigenvalues: 1
##
## Average distance to median:
## 2014 2015
## 0.5281 0.5307
##
## Eigenvalues for PCoA axes:
## (Showing 8 of 11 eigenvalues)
## PCoA1 PCoA2 PCoA3 PCoA4 PCoA5 PCoA6 PCoA7 PCoA8
## 1.0978 0.7530 0.4624 0.3755 0.3272 0.2831 0.1294 0.1008
# Visualization of the multivariate homogeneity of group dispersions
# extract the centroids and the site points in multivariate space.
centroids <- data.frame(grps=rownames(mod$centroids),data.frame(mod$centroids))
vectors <- data.frame(group=mod$group,data.frame(mod$vectors))
# to create the lines from the centroids to each point we will put it in a format that ggplot can handle
seg.data <- cbind(vectors[,1:3],centroids[rep(1:nrow(centroids),as.data.frame(table(vectors$group))$Freq),2:3])
names(seg.data) <- c("group","v.PCoA1","v.PCoA2","PCoA1","PCoA2")
# create the convex hulls of the outermost points
grp1.hull <- seg.data[seg.data$group=="2014",1:3][chull(seg.data[seg.data$group=="2014",2:3]),]
grp2.hull <- seg.data[seg.data$group=="2015",1:3][chull(seg.data[seg.data$group=="2015",2:3]),]
all.hull <- rbind(grp1.hull,grp2.hull)
# Graphic
panel.a <- ggplot() +
geom_point(data=centroids[1,1:3], aes(x=PCoA1,y=PCoA2),size=4,colour="red",shape=16) +
geom_point(data=seg.data[c(1,3,5,7,9,11),], aes(x=v.PCoA1,y=v.PCoA2),size=2,shape=16) +
labs(title="2014",x="",y="") +
coord_cartesian(xlim = c(-0.5,0.5), ylim = c(-0.5,0.5)) +
theme_bw() +
theme(legend.position="none")
panel.b <- ggplot() +
geom_point(data=centroids[2,1:3], aes(x=PCoA1,y=PCoA2),size=4,colour="red",shape=17) +
geom_point(data=seg.data[c(2,4,6,8,10,12),], aes(x=v.PCoA1,y=v.PCoA2),size=2,shape=17) +
labs(title="2015",x="",y="") +
coord_cartesian(xlim = c(-0.5,0.5), ylim = c(-0.5,0.5)) +
theme_bw() +
theme(legend.position="none")
panel.c <- ggplot() +
geom_point(data=centroids[,1:3], aes(x=PCoA1,y=PCoA2,shape=grps),size=4,colour="red") +
geom_point(data=seg.data, aes(x=v.PCoA1,y=v.PCoA2,shape=group),size=2) +
labs(title="All",x="",y="") +
coord_cartesian(xlim = c(-0.5,0.5), ylim = c(-0.5,0.5)) +
theme_bw() +
theme(legend.position="none")
grid.arrange(panel.a,panel.b,panel.c,nrow=1)2.6.4 : Créer une cartographie des points d’échantillonnages des données
Plus d’informations ici :
Nécessite certaines librairies spécifiques à installer/charger :
#install.packages("devtools") # install + load packages
#devtools::install_github("ropensci/rnaturalearthhires")
library("rnaturalearth")
library("ggspatial")
library("cowplot")
library("ggrepel")Créer la carte du Sénégal (dans le cadre de cette étude) :
senegal_map <- ne_states(country = c("senegal", "gambia", "mauritania", "mali", "guinea bissau", "guinea"), returnclass = "sf") # chargement des pays proches du sénégal pour faire une carte plus large
senegal_mapCréer la carte du Monde :
# load world map
world <- ne_countries(scale = "medium", returnclass = "sf")
# create new column to put a discrete value on Senegal
world$SEN <- world$name == "Senegal"
world$SEN[world$SEN == TRUE] <- 1
world$SEN[world$SEN == FALSE] <- 0
# plot world_map with senegal highlight
map_World <- ggplot() +
geom_sf(data = world, colour = "black", fill = ifelse(world$name == "Senegal", '#FCFC9C', '#ededeb')) + # highlight senegal
annotation_scale(location = "bl", width_hint = 0.2) + # put scale on map
coord_sf(xlim = c(-16, 50), ylim = c(-35, 35)) + # Zoom on the area of interest
theme(panel.background = element_rect(fill = "aliceblue"),
plot.background = element_blank()) + # put blue background (for sea)
#theme_void() +
annotate(geom = "text", x = c(-5,43), y = c(-14,-32), label = c("Atlantic Ocean", "Indian Ocean"),fontface = "italic", color = "grey22", size = 4) # annotate Ocean
map_WorldCharger un fichier de métadonnées comprenant la position géographique des sites (Latitude et Longitude) :
SupData_merge <- read_excel(Map.file)
SupData_mergeCréer la carte récapitulative avec la carte du Sénégal et la carte du monde intégrée pour la localisation du Sénégal à l’échelle globale
# plot senegal with point and world map in top right
ggdraw(ggplot() +
geom_sf(data = senegal_map, colour = "black", fill = ifelse(senegal_map$iso_a2 == "SN", '#FCFC9C', '#ededeb')) + # highlight senegal
theme(panel.background = element_rect(fill = "aliceblue")) + # put blue background (for sea)
geom_point(aes(x=Longitude, y=Latitude, group=Region), data=SupData_merge, size = 6) + # put point of region of interest
geom_text_repel(aes(x=Longitude, y=Latitude, label = Region), data=SupData_merge, size = 5, fontface = 2, box.padding = unit(1.5, "lines"), point.padding = unit(1.5, "lines"), segment.size = 1) + # put lines for point
coord_sf(xlim = c(-18, -10), ylim = c(11.8, 17)) + # zoom coord
annotation_scale(location = "bl", width_hint = 0.2) + # put scale on map
annotation_north_arrow(location = "tl", which_north = "true",
height = unit(2, "cm"), width = unit(2, "cm"),
style = north_arrow_fancy_orienteering)) + # put arrow on map
draw_plot(map_World, width = 0.4, height = 0.4, x = 0.585, y = 0.59) # put world map on top rightReferences
Allaire, JJ, Jeffrey Horner, Yihui Xie, Vicent Marti, and Natacha Porte. 2019. Markdown: Render Markdown with the c Library Sundown. https://github.com/rstudio/markdown.